RNA, SCFGs and Classifiers

نویسندگان

  • Zsuzsanna Sükösd
  • Paula Tataru
  • Jotun Hein
چکیده

We developed, implemented and tested a stochastic context-free grammar (SCFG)-based method to classify RNA molecules into structural classes, by training grammars to simultaneously recognize certain types of RNAs and disrecognize other types. We tested our program using datasets obtained from thermodynamic (RNAfold) predictions of structures for 6000 tRNA and 6000 miRNA sequences, and we believe our program serves as a "proof of principle" of the structural classification of RNAs. We expect that incorporating evolutionary information from sequence alignments would significantly improve structure prediction. Furthermore, our method can naturally be extended to involve a higher number of grammars. Introduction In recent years, considerable interest has developed to elucidate the relationships between RNA sequence, structure and function, especially with regards to newly discovered roles of RNAs in cells and viruses. The computational prediction of RNA structures plays a vital role in bridging the gap between sequencing and experimental structure determination, and also facilitates the interpretation of experimental data. In 1994, stochastic context free grammars (SCFGs) were introduced in two papers (Durbin and Eddy, 1994 and Sakikabara et al., 1994) to describe and analyze RNA structures. Knudsen and Hein (1999, 2003) coupled a SCFG to molecular evolution and thus created a comparative predictor. Most RNA gene predictors try to classify a sequence into two classes: "RNA gene" and "background functionless DNA". However, it is often of interest to classify an RNA gene/structure into functional classes. Typically, this is either done by homology (if two sequences are similar, they probably have the same function) or by some additional classifier algorithm (Batuwita and Palade, 2009 & Klingelhoefer, Moutsianas and Holmes, 2009). In this project, we attempted to classify RNA molecules into structural classes, by training SCFGs to recognize some types of RNAs at the same time as disrecognizing others. Our stochastic context-free grammar RNA sequences are long chains of 4 nucleotides, and can be represented as a string consisting of 4 different letters: A (adenine), C (cytosine), G (guanine) and U (uracil). RNA secondary structure (ie. the two-dimensional folding of RNA molecules) can also be described as a string of symbols: the single-nucleotide symbol ".", and the base-pairing symbols " ( ) " , where the brackets indicate that the nucleotide in position of the first one base-pairs with the nucleotide in the position of the second one. (Figure 1a) This description makes it possible to produce RNA structures using context-free grammars. A formal grammar in language theory consists of: • a set of nonterminal symbols • a set of terminal symbols (disjoint from the set of nonterminals) • a set of production rules that map one string of symbols to another. A distinguished nonterminal symbol is the start symbol. A context-free grammar is a grammar in which every production rule maps a single nonterminal symbol into a string of terminal and nonterminal symbols. In this project we used a context-free grammar designed by Bjarne Knudsen that can generate strings to describe all RNA structures that contain no pseudoknots and have a minimum of 2 nucleotides in between any base-pairs. By assigning a probability to the different rules that generate strings from the same nonterminal symbol, we obtain the stochastic context-free grammar: S -> LS prob. p1 L -> . prob. 1 p2 S -> L prob. 1 -p1 F -> (F) prob. p3 L -> (F) prob. p2 F -> LS prob. 1 p3

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars

Pairwise stochastic context-free grammars ("Pair SCFGs") are powerful tools for finding conserved RNA structures, but unconstrained alignment to Pair SCFGs is prohibitively expensive. We develop versions of the Pair SCFG dynamic programming algorithms that can be conditioned on precomputed structures, significantly reducing the time complexity of alignment. We have implemented these algorithms ...

متن کامل

CONTRAfold: RNA secondary structure prediction without physics-based models

MOTIVATION For several decades, free energy minimization methods have been the dominant strategy for single sequence RNA secondary structure prediction. More recently, stochastic context-free grammars (SCFGs) have emerged as an alternative probabilistic methodology for modeling RNA structure. Unlike physics-based methods, which rely on thousands of experimentally-measured thermodynamic paramete...

متن کامل

Small Subunit Ribosomal RNA Modeling Using Stochastic Context-Free Grammars

We introduce a model based on stochastic context-free grammars (SCFGs) that can construct small subunit ribosomal RNA (SSU rRNA) multiple alignments. The method takes into account both primary sequence and secondary structure basepairing interactions. We show that this method produces multiple alignments of quality close to hand edited ones and outperforms several other methods. We also introdu...

متن کامل

An evolutionary algorithm for stochastic context-free grammar design, with applications to RNA secondary structure prediction

Stochastic Context-Free Grammars (SCFGs) have been used widely in modelling RNA secondary structure. They were motivated by the use of Hidden Markov Models (HMMs) in protein modelling (Krogh et al., (1993)). What was lacking in HMMs though, was the ability to model long range interactions which are necessary to provide an effective model for RNA secondary structure. Thus, SCFGs, as generalisati...

متن کامل

Maximizing Expected Base Pair Accuracy in RNA Secondary Structure Prediction by Joining Stochastic Context-Free Grammars Method

The identification of RNA secondary structures has been among the most exciting recent developments in biology and medical science. Prediction of RNA secondary structure is a fundamental problem in computational structural biology. For several decades, free energy minimization has been the most popular method for prediction from a single sequence. It is based on a set of empirical free energy c...

متن کامل

A Genetic Algorithm for Grammars

Ribonucleic acid (RNA) secondary structure prediction is an important problem in molecular biology; almost as soon as RNA started to be sequenced, methods have been established to determine the structure from the sequence of nucleotides. From the secondary structure one can start to find the function of a given RNA molecule, which is of course important in understanding how the cell works, char...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009